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Abstract. The fc-support norm has been recently introduced to per- 
form correlated sparsity regularization 1 . Although Argyriou et al. only 
reported experiments using squared loss, here we apply it to several 
other commonly used settings resulting in novel machine learning algo- 
rithms with interesting and familiar limit cases. Source code for the algo- 
rithms described here is available from https://github.com/blaschko/ 
ksupp ort] 
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1 The fc-support Norm 

The fc-support norm is the gauge function associated with the convex set 

conv{/3| ||/3|| <fc,||/3|| 2 <l}. (1) 

It can be computed as 

(k—r—1 / d \ ^\ 2 

E (i/^ + ^f E ifli) ( 2 ) 

where \ is the ith largest element of the vector and r is the unique integer in 
{0, . . . , k — 1} satisfying 

mi-r-i > E \ri * \p\i-r- (3) 

i—k—r 

We use the following notation here: X g M. nxd is a design matrix of n samples 
each with d dimensions; y e W 1 is the vector of targets. 



In the case that k = 1 the fc-support norm is exactly equivalent to the £\ 
norm. In the case that k — d, where f3 £ M. d , the fc-support norm is equivalent 
to the £2 norm. 

We note that for an objective 

wmXWPW? + f(l3,X,y) (4) 

with some loss function /(•, •, •), when k — d, this is equivalent to 

mm\\\p\\ 2 + f(/3,X,y) (5) 

rather than the familiar squared £2 regularizer. However, for any A there exists 
some A such that 

a TgmmX\\0\\ 2 + f(P,X,y) = a X gmm~X\\j3\\l+ f(P,X,y). (6) 

This can be easily seen by noting that the objectives are the Lagrangians of 
constrained minimization problems that minimize / subject to the equivalent 
constraints \\PW2 < B and \\fiW2 < B 2 , respectively, for some B € K + . 



2 Squared Loss 

If we use Nesterov's accelerated method (a first-order proximal algorithm) for 
optimization as suggested in [T], a given implementation of fc-support regularized 
risk requires a function that computes the loss /, a function that computes the 
gradient of the loss function g% , and the Lipschitz constant L for . We assume 
that / is convex and differentiable everywhere and that L is finite. 
For the squared loss: 

f 2 (f3,X,y) = \\Xp-y\\ 2 (7) 
|| - 2X T Xf3 - 2X T y (8) 
L 2 = 2 7 (9) 

where 7 is the largest eigenvalue of X T X. 
The objective function 

A||/3||f + \\Xp-yf (10) 

clearly has the lasso [TT] and ridge regression jT2j as special cases when fc = 1 and 
fc = d, respectively. Argyriou et al. [J have previously discussed the relationship 
to the elastic net |17j . The fc-support norm with squared loss has been shown to 
give good results on fMRI data [7]. 



3 One Sided Squared Loss 



While we have previously assumed that y g R™, here we will assume we are 
dealing with the binary classification case where y E { — 1, +1}™. One sided 
squared loss simply computes the squared loss when a margin is violated, and 
zero otherwise. 

n 

h- w, x,y) = Yl (^i 1 - y* (0> x i) > °» 2 ( u ) 

(=1 

dh- = ^i0 i£y i (P,x i )>l 
d/3 f^ 1 \2{l3,x i )x i -2y i x i if yi(0,Xi) < 1 

Li- = 27- (13) 

One sided squared loss has been considered, for example, in [4]. 



4 Hinge Loss 

Hinge loss is not differentiable, so we apply a Huber approximation to hinge 
loss [4](^]The Huber parameter is denoted h; 

{0 if Vi{P,Xi) > 1 + h 

(i +h - V; (^)f x\l- yi ^ Xi) \< h (14) 
l- yi {P, Xi ) \£yi(p,Xi)<l-h 

„ '0 if Vi{p,Xi) > 1 + h 

>>h ^ { ^ i)xi - ( i +h)yiXi i{ll _ yMxi)] < h (15) 



dd 



-yiXi if yi{P,Xi) <l-h 



Li - I (16) 

where 7 is as before the largest eigenvalue of X T X. We note that the Lipschitz 
constant is in a sense conservative in that it grows with the inverse of h, while we 
might expect a smaller fraction of the data to actually fall within the quadratic 
portion of the data. Nevertheless for h not too small, we have not observed any 
convergence issues with Nesterov's accelerated method. While a small value of 
h may be desirable in a kernelized setting, here we desire Hinge loss not for 
sparsity of a dual coefficient vector (indeed the fc-support norm does not admit 
a representer theorem [5]), but rather that the loss not grow more than linearly 
while remaining convex. In other words, we use the hinge loss primarily for its 
increased robustness over other losses such as (one-sided) squared loss. 



1 Although it is perhaps more natural to incorporate non-differentiable losses with the 
k-support regularizer in a proximal splitting approach, we have arbitrarily closely ap- 
proximated non-differentiable losses by differentiable ones for the sake of uniformity 
of presentation and software implementation. 



The limit cases are the support vector machine (SVM) [5] when k = d and 
the l\ regularized SVM [16] when k = 1. The /c-support regularized SVM can be 
seen as an alternative to the elastic net regularized SVM P3] , but with a tighter- 
convex relaxation to correlated sparsity (Equation (U])). 



5 Logistic Loss 

Logistic loss is derived from logistic regression, and its minimization is equivalent 
to logistic regression in the case that it is unregularized [5]. 

n 

f log (/3, X,y)=Y, (l + e-"^')) (17) 

i=l 

at. ™ P -Vi(P^n) 

^%r = -Y — 75— rW* ( 18 ) 

l — l 

Lio S = \ (19) 

where the Lipschitz constant has a factor | from the Lipschitz constant of the 
sigmoid in 9 g^ s ■ /c-support regularized regression specializes to previously used 
regularized logistic regression objectives [9j when k = 1 or k = d. 

6 Exponential Loss 

Exponential loss is known primarily through its use in AdaBoost.Ml [618] . 

n 

f exp (P,X > y)=J2^ tHi0 ' Xi) (20) 
i=i 

^ = -±e~y^y iXi (21) 

Here, the loss is not globally Lipschitz continuous. However, one may attempt 
to estimate a sufficiently large constant if one were to apply learning with the 
fc-support norm and Ncsterov's accelerated method (we have simply used a rel- 
atively conservative 50 x 7 in the experiments reported in Section [8]). As expo- 
nential loss is highly degenerate in the presence of label noise (essentially for the 
same reason that it is not globally Lipschitz continuous), this is likely of limited 
utility in real- world applications. We have included this loss here primarily for 
completeness, and have not explored any other optimization strategies. 

7 s-insensitive Loss and Huber Smoothed Absolute Loss 

e- insensitive loss is defined to be [13]: 

\ Vi - (/?, a*) | E := max{0, \y - (J3, x t )\ - e} (22) 



-5 -4 -3 -2 -1 1 2 3 4 5 

(a) e = 2 gives an insensitive region around the cor- 
rect regression value. 
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(b) The special case that e — results in a Huber 
smoothed absolute loss. 



Fig. 1. Huber smoothed e-insensitive loss. On the horizontal axis is j/j — {fl,Xi) while 
the vertical axis plots f e in blue, and in red. In both plots h = |. 



Table 1. Accuracies for each method and regularizes See text for the experimental 
setting. The fc-support norm achieved higher acuracies on average for all loss functions. 





h 


h- 


h 


/log 


/exp 


u, 


A 


11/311? 


0.883 ± 0.058 


0.883 ± 0.058 


0.890 ±0.057 


0.889 ± 0.056 


0.888 ± 0.060 


0.889 ± 0.065 


0.886 ± 0.062 


Il/3||i 


0.870 ± 0.062 


0.870 ± 0.062 


0.868 ± 0.069 


0.872 ± 0.063 


0.876 ± 0.065 


0.870 ±0.077 


0.879 ± 0.059 


\\0h 


0.871 ± 0.071 


0.871 ±0.071 


0.872 ± 0.065 


0.872 ± 0.066 


0.870 ± 0.067 


0.867 ±0.071 


0.872 ± 0.063 



for some parameter e > 0. While e has an important role in the sparsity of 
the dual representation for support vector regression [15] . that role is not re- 
quired in the primal. As with hinge loss, we use Huber smoothing to guarantee 
differentiability. 
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if \ Vi - (J3,Xi)+e\ < h 
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(23) 



(24) 



(25) 



Here we have decomposed the e insensitive loss into two hinge components to 
emphasize the relationship to Huber smoothed hinge loss (cf. Section A plot 
of the loss and its gradient is shown in Figure [T] In the case that e = we get 
a Huber smoothed absolute loss function as a special case (denoted / a b s in the 
sequel), and the curvature of the loss function at yi — ((3,Xi) = is doubled, 
therefore the Lipschitz constant is double that of the one sided hinge loss. 

In the case that k = d, we recover the special case of £-support vector regres- 
sion (e-SVR) (TOj . If we set k = 1 we get an i\ regularized variant of e-SVR. In 
the case that e = this l\ regularized variant is equivalent to regularized least 
absolute deviations regression [TS] . In Equation ( 23 ) , e < is equivalent to e > 



but with a constant value added to the loss everywhere, i.e. the minimizer is the 
same. 



Table 2. Mean squared errors (MSE) for each method and regularizes See text for the 
experimental setting, ft, /2-, /abs, and f e achieved the lowest MSEs with the fc-support 
norm regularizer giving best results on average. 
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1.21e2±4.89el 


1.21e2±4.89el 


1.78e2 ± 1.00e2 


3.33e3±5.39e3 


1.59e3±2.89e3 


1.25e2 ± 5.41el 


2.21e2±1.51el 


phi 


1.25e2±4.81el 


1.2Se2 ± 4.81el 


2.21e2±9.63cl 


1.13e4±9.89e3 


6.16e3±4.82e3 


1.48e2 ± 1.76e2 


2.16e2 ± 1.66el 




1.49e2 ± 4.75el 


1.49e2±4.74el 


1.81e2 ± 7.66el 


4.18e3±8.00e3 


3.08e3±4.88e3 


1.50e2 ± 5.34el 


2.25e2±1.56cl 



8 Experiments 

We have applied each of the algorithms above to a toy classification problem con- 
ceptually similar to that reported in pQ. In all cases, we perform model selection 
for k G {1, . . . , d} and A = 10% i G {— 15, . . . , 5}. We compare additionally to 
the special fixed cases k = 1 and k = d corresponding to t\ and l 2 regularization, 
respectively. 

Output labels were generated randomly with equal probability. The first 15 
dimensions were set by multiplying the label by a fixed vector of 15 samples from 
a zero mean Gaussian and adding Gaussian noise (i.e. a noisy signal is contained 
in the first 15 dimensions). The subsequent 50 dimensions were set to zero mean 
Gaussian noise (i.e. the subsequent dimensions contain no signal and should be 
ignored) . 50 samples were used for training, 50 for validation, and 250 for testing. 
Table [T] gives the mean accuracies for each method across 20 random problem 
instances, while Table [2] gives the mean squared error (MSE). For e-insensitive 
loss we arbitrarily set e = 1. For all methods with a Huber smothing parameter, 
we set h — jq. 

It should be noted that several of the methods employed here for classification 
were developed for regression (squared loss, absolute loss, and e-insensitive loss). 
The experiments performed here were done primarily to validate their correct 
implementation. 

9 Conclusions 

We have described and implemented a large number of loss functions for non- 
diffcrcntiably regularized risk optimization with proximal splitting methods. 
These loss functions in combination with the fc-support norm yield a large num- 
ber of learning algorithms proposed in the literature as special cases. Assuming 
zero model error, each of these loss functions is sufficient to yield a statistically 
consistent algorithm^] (provided regularization goes to zero at a sufficient rate 
as the number of samples goes to infinity) [3l Theorem 4]. However, their finite 
sample behavior varies substantially. We hope that their implementation and de- 
scription in a common framework will facilitate their analysis and employment 
in machine learning studies and applications. 



2 Huber smoothed £-insensitive loss requires that e — h < 1 for consistency in the 
binary classification setting, yi G { — !,+!}. 
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